Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Unicode tables to 9.0 #34599

Merged
merged 1 commit into from
Jul 15, 2016
Merged

Update Unicode tables to 9.0 #34599

merged 1 commit into from
Jul 15, 2016

Conversation

cuviper
Copy link
Member

@cuviper cuviper commented Jul 1, 2016

I just updated unicode.py's generated copyright year, then ran it.

@rust-highfive
Copy link
Collaborator

r? @eddyb

(rust_highfive has picked a reviewer for you, use r? to override)

@eddyb
Copy link
Member

eddyb commented Jul 1, 2016

r? @alexcrichton

@rust-highfive rust-highfive assigned alexcrichton and unassigned eddyb Jul 1, 2016
@alexcrichton
Copy link
Member

cc @SimonSapin, @rust-lang/libs

Thoughts on the backwards compatibility implications of a change like this? This seems like something we'd want to do although if it has bad implications we may just want to think through it.

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jul 1, 2016
@cuviper
Copy link
Member Author

cuviper commented Jul 1, 2016

For reference, here are the official Unicode 9.0.0 changes, and the Migration section in particular should help evaluate compatibility. I don't know enough about how these properties are used to answer myself.

@brson
Copy link
Contributor

brson commented Jul 1, 2016

When we've discussed unicode and compatibility in the past, I recall we've leaned toward giving ourselves leeway to upgrade. @cuviper based on the unicode changelog do you know what impacts this has on specific rust functions? If this makes e.g. changes to Unicode identifiers (which it looks like it does) that impacts the Rust language definition.

@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Jul 1, 2016
@cuviper
Copy link
Member Author

cuviper commented Jul 1, 2016

@brson Yes, there are changes to the XID tables. I believe these are mostly additions for the new scripts, but I'm not sure of that. The UAX #31 Migration talks about changing the formal definitions of ID/XID, which isn't clear to me either, but I think it's just changing emphasis.

@nagisa
Copy link
Member

nagisa commented Jul 1, 2016

To the best of my knowledge we have no public unicode functionality in libraries shipped with rustc or compiler itself which would be impacted by move to 9.0.

  • We do not NFKC normalisation for our identifiers and AFAIR our XID Tables were already explicit.
  • Adding new scripts is not breaking unless somebody relied on previously unassigned symbols becoming a replacement character in certain cases; that would not happen anymore with newly assigned codepoints;

The only thing we might want to do is check our “easily confused symbols” table thing and see if it needs adjustment for the new codepoints (doubtful about it).

@nagisa
Copy link
Member

nagisa commented Jul 1, 2016

To the PR author, you might need to adjust script more for new properties and similar changes.

To the future reviewers: make sure the tables related to new properties are indeed correct and exhaustive.

@cuviper
Copy link
Member Author

cuviper commented Jul 2, 2016

@nagisa There is one new property, Prepended_Concatenation_Mark, but it looks like unicode.py is already not exhaustive, only loading a few "interesting" properties:

props = load_properties("PropList.txt",
        ["White_Space", "Join_Control", "Noncharacter_Code_Point", "Pattern_White_Space"])

And then only White_Space and Pattern_White_Space are actually written to tables.rs.

So it seems rustc_unicode is already not trying to represent the entire Unicode standard. Is there anything in particular that you think needs to be added explicitly?

@nagisa
Copy link
Member

nagisa commented Jul 2, 2016

I was more worried about

The constraints on standardized variation sequences have been relaxed slightly, to allow a spacing combining mark (General_Category = Spacing_Mark) as the initial character of a variation sequence. Nonspacing combining marks and canonical decomposable characters are still disallowed in variation sequences. Implementations should be checked for any assumptions regarding the allowed General_Category property values for the initial characters in variation sequences.

but it seems fine too, since we do not load that either.

@cuviper
Copy link
Member Author

cuviper commented Jul 11, 2016

Is there anything waiting on me here? Or is this just waiting for a review decision?

@alexcrichton
Copy link
Member

@cuviper ah no it's all on our end, the libs team just needs to discuss this basically. (would love to get @SimonSapin's thoughts as well)

@SimonSapin
Copy link
Contributor

In general I’m in favor of keeping up to date with Unicode. Hard-coding a Unicode version was one of the big issues of IDNA 2003. And I believe the Unicode Consortium to be mindful of backward compatibility when making changes. http://www.unicode.org/policies/policies.html talks about stability.

And for what it’s worth, a second-hand story from The Olden Days:

https://tools.ietf.org/html/rfc3629#section-5

ISO/IEC 10646 is updated from time to time by publication of amendments and additional parts; similarly, new versions of the Unicode standard are published over time. Each new version obsoletes and replaces the previous one, but implementations, and more significantly data, are not updated instantly.

In general, the changes amount to adding new characters, which does not pose particular problems with old data. In 1996, Amendment 5 to the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded the Korean Hangul block, thereby making any previous data containing Hangul characters invalid under the new version. Unicode 2.0 has the same difference from Unicode 1.1. The justification for allowing such an incompatible change was that there were no major implementations and no significant amounts of data containing Hangul. The incident has been dubbed the "Korean mess", and the relevant committees have pledged to never, ever again make such an incompatible change (see Unicode Consortium Policies [1]).


That said I haven’t looked at all at what changed in 9.0. http://unicode.org/versions/Unicode9.0.0/#Migration would be the thing to review.

@alexcrichton
Copy link
Member

Ok, cool, thanks @SimonSapin! I suspect that @rust-lang/libs will probably all respond with "lgtm" @cuviper

@brson
Copy link
Contributor

brson commented Jul 13, 2016

lgtm

@aturon
Copy link
Member

aturon commented Jul 13, 2016

I'm happy to delegate to the experts here, so lgtm :)

@alexcrichton
Copy link
Member

@bors: r+ 452e4ed

@bors
Copy link
Contributor

bors commented Jul 15, 2016

⌛ Testing commit 452e4ed with merge 3e15fcc...

bors added a commit that referenced this pull request Jul 15, 2016
Update Unicode tables to 9.0

I just updated `unicode.py`'s generated copyright year, then ran it.
@bors bors merged commit 452e4ed into rust-lang:master Jul 15, 2016
@cuviper cuviper deleted the unicode-9.0 branch September 26, 2017 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants